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ABSTRACT 

This paper contends that, apparently, the patterns of 
above-^norm performance that J. Cannell (1987) reported exiJt, 
although the extremity of his "Lake Wobogon" effect is perhaps 
overstated. The use of outdated test noi^ms and oti'^er practices has 
been identified as a partial explanation for the reporting of 
"above-average" scores. Two questions arise: how can .he use of 
commercial norm-referenced tests be changed to lead to more accurate 
reporting?"; and (2) n?hich standards for reporting test results are 
reasonable? Recommendations are m&de for improving test use, which 
center on the following areas: (1) improving documentation; (2) more 
frequent norming; and (3) use of multiple test forms. It is important 
to put the reporting of results in context within states, districts, 
and schools. The problem has generally been that results are reported 
too simplistically . Three types of data are generally nec^jssary to 
contextualize test results: longitudinal trends, performance 
distributions, and sub^^xoup comparisons (by race/ethnicity or 
socioeconomic statu , for example). To report achievement test data 
more comprehensively, better yardsticks and better use of them are 
needed. An eight-item list of references is included. An appendix 
contains a sample performance distribution graph. (SLD) 
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Questions about the meaning of reported achievement test rosults and 
whether the public is being misled are serious matters. Regardless o' what 
educational professionals believe and want, virtually the entire nor -educational 
sector (poMtidans, business community, parents, media) view testing as a valid and 
useful means of monitoring educational progress and see tests a^ viable tools fn 
holding institutions and individuals directly accountable. They want to know the 
"truth" about how American students and schools are achieving. 

The dilemma, as proiessional educators know, is that there Is no one truth 
when it comes to assessing student achievement. Using the same measures to 
monitor progress (in the sense of trying to keep abreast of where we stand) and to 
hv)ld specific educ'tlona' units accountable raises the spectre of corrupting the 
meaning of the measures. Changing performance and changing achievement aren't 
synonymous, as Linn, Graue, Sanders (1990) and Shepard (1990) remind us. If we 
become too obsessed with measuring accurately the average performance of the 
students nationally, regionally, and locally, we can do a d'sservice to the educational 
improvement effort. 

Apparently, the patterns of above norm performance that John Cannell 
(1987) reported are there although the extremity of the "Lake Wobegon" effect is 
perhaps overstated. Linn et al. (1990) point to the use of dated norms as a partial 
exDlanatlon for the high scores. Shepaid (1990) identifies other practices that 
likely Inflate reported test results. Two questions come to mind about what to do in 
light of this evidence. First, how can the use of commercial norm-referenced tests as 
measures of student achievement In high stakes testing contexts be changed to lead 
to more accurate public reporting of results? Second, regardless of how these tests or 
their use are changed, what are reasonable standards for "fully" reporting test results 
to "Inform" educational Improvement? 



Changes in Test Use 

The first question generates a laundry list of suggested modiflcat'ons, both 
mild and strong. The list includes at least the following: 

1. Accurately desaibe the norm group and tests adminh ored in all 
documents and reports and properly "educate" the lay reader r ^,ar(^lni'. the specific 
meaning of "average" being used by the state or district. 

2. Fully desaibe all systematic exclusions of students from the tested grou j 
In the school/aistrict/state and the likely consequeno?i of these e?fcluslo?iri with 
regard ♦^o comparisons to the norms used. 

3. Fully describe test administration and security proceourcs aad thcli likely 
consequences with respect to comparisons to norms. 

4. Develop guidelines/sanctions regarding appropiiate and inappropriate 
test preparation and report evidence on local adherence to guidelines. 

5. Report performance according to 'annual user norms" to discourage the 
practice of comparison to dated norms. 

6. Renorm tests more frequently (annually or biannually) and report 
performance only with res[)ec: to "new norm" data. 

7. Use multiple commctdal tests administered randomly throughout schools 
and districts each year to leduce the "benefits" of teaching to a specific test. 



8. Develop multiple alternative forms and administer alternative forms 
randomly throughout schools and districts each year. 

Documentation 

Recommendations one through four above are essentially mild changes in 
the provision of information about what test results represent. Yet as Shepard s 
(1990) results indicate, there is certainly little uniformity across states and districts in 
reporting this information. From a cursory examination of the state reports 
provided for the Linn et al. (1990) analyses, none of the states reports df»tails on all 
four areas (norms description, sample exclusion, administration instructions, test 
preparation guidelines). However, there are exemplary practices in documentation, 
as illustrate i by an appendix to North Carolina's report which includes a glossary of 
terminology used and a discussion of the choice of reporting metric and what other 
alternatives were contemplated and rejeaed. 

Clearly, brief, glossy, graphically presented results are more likely to atirac; 
positive response from the public and policy communities. Such priors dictate 
against including "messy" details in the body of annual district and state reports. 
Nevertheless, public documentation of practices, either through technical 
appendices or supplementary reports, should be routine practice as a means to 
improve the public policy discussion. 

Frequent Normlng 

Recommendations five through tighi cviquire stronger changes that can be 
prohibitively expensive. Obviously, annual user norms are not hard to generate nor 
necessarily very expensive. On the other hand, annual or biannual norming, if 
done properly and carefully, would be unnecessarily burdensome for everyone 
involved. 

What is troubling about either of the norming recommendations is that in 
many respects, both miss the mark of the purpose of annual coUeaion and 
reporting of test results. These results are supposed to measure the status and 
progress of the school system But either recommendation would change the metric 
for reporting frequently or add additional metrics to contend with. The standard for 
comparison itself becomes a frequently moving target devoid of anchoring. Trends 
in performance would be restriaed to cross-sectional ones at a given point in time. 
Thus one would be replacing one bad signal (inflated performance due to test 
familiarity and dated norms) with another. 

Multiple Forms 

The use of multiple alternative test form, either from a single publisher or 
from multiple publishers would improve matters, since problems associated with 
"teaching to the test" would be lessened. However, the domain represented in 
these multiple forms may still be a lean one, and thus susceptible to corrosive 
(beating the test) testing prifaiccs. Moreover, it is unlikely that publishers would 
make the investments necessary to expand the number of forms they offer due to 
prohibitive cost; nor would they be likely to encourage the kinds of cooperative 
behavior that districts and states would need to administer tests from multiple 
publishers on a random basis. 

Actually, one ^f the unfortunate consequences of Cannell's fixation on the 
Lake Wobegon-type r -suits from commercial achievement tests is that it detracted 
attention from those itates who employ alternative assessment strategies. Some 
states (e.g., California) annually administer many more test items targeted to specific 
airriculum areas and use this information to monitor performance and progress over 



time. Item sarapling techniques originally developed and used in the National 
Assessment of Educational Progress (NAEP; no student takes all items but all items are 
administered to random sample of students within schools and distrias) yield a 
considerable amount of additional curricular detail. For example, California 
administered 186 mathematics test items at grade 8 from 1983-84 through 1985-86 
covering 8 broader skill areas and 33 more specific ones with at least 12 items per 
area at the most specific level. Any school or district that can teach to this 
potentially bioad a domain of content should raise student achievement as well as 
test performance. 

This testing technology, in newer, more curricularly and instrualonally 
relevant forms, should be dominating state and even district testing praaices 
intended for monitoring and reporting progrc^. But it isn't, at least not in 
comprehensive assessment activities; rather, domain oiiented assessment in schools 
seems to have been restricted to competency und proficiency testing of students 
rather than monitoring system performance. 

For those concerned about normative data, work by Bock and Mislevy (19P8) 
offei^ a means of scaling dau from comprehensive assessments in ways that can 
anchor results to a given time point and, under cerain conditions, can represent 
score scales with the tyr^es of content tasks achievable at a given score level. The 
California Assessment I^rogram has been employing these methods since the mid 
1980s and NAEP under the Educational Testing Service (ETS) has increasingly relied 
on variations of such reporting practices. 

But simply expanding the number of items administered and using fancier 
analytical technology will not make test reporting rosy. Nevertheless, debating the 
technical shortcomings of these types of alternatives to current com.ijercially 
available tests may be a better use of time than to continue introducing costly "fixes' 
of commerciai test results for monitoring educational progress. 



"Full" Contextuallzed Reporting 

The root of Cannell's concern seems to be that somebody (various state and 
district officials, test publisher, or both) is intentionally deluding the public by 
reporting above average performance and harming children by falsely telling them 
(and their parents) that they are doing okay. While Cannell may be on the right 
track, his message is potentially as limiting as the practices he decries. Frankly, if ail 
a state or distria does is report the percent of their students above the national 
norm in a given year, the results are misleading, regardless of the test used and 
testing procedures. In district, sute, national, or international testing programs, the 
practice of dwelling only on system-level average performance is simplistic, wasteful, 
counter-pro I'laive, and invariably misleading. To the degree that Cannell's 
obsession wit/, state and district average perfonmance detracts from the efforts to 
more compre-?.ensively report their students' performance, his challenges 
perpetuate the worst, rather than the best, of large-scale assessment. 

My point is that the single-minded concentration on the central tendency in 
states and districts is misguided regardless of who is doing it. It is important to 
contextualize the reporting of results within states, districts, and schools. 
Complaining that states and districts are using an easily misinterpreted metric for 
reporting when in faa the problem is how siraplistically data are reported is 
misplaced rigor and piety. 

As part of a feasibility study on using existing data collected by the states to 
construa education indicators for state-by-state comparisons of student performance 
at the national level (Burstein, Baker, Aschbachcr, & Keesling, 1986), we urged that 



auxiliary information about students and schools be used to contextualize the 
description of educational performance within states (and other educational units). 
Our analysis of state reports of assessment results (primarily for the years 1982-84) 
indicated that whiU a remarkable variety of interesting Information (background, 
resources, curriculum and instructional activities at the student, school, and district 
levels) was being collected, there was little comparability in the collection and 
reporting of auxiliary Information. 

But in the ongoing development efforts regarding state level education 
indicators, the concern for contextuallzlng any achievement comparisons has 
become virtually axiomatic. The National Asses.jment Planning Project conduaed by 
the Council of Chief State School Officers (CCSSO, 1988) devoted 5 of their 12 
recommendations and well over half the report to advocate reporting (a) 
distributions of scores within states, (b) cross-seaional trends as changes in the 
proportions of students at specified proficiency levels, (c) subgroup reporting, (d) 
rankings by demographic variables, and (e) relating achievement to education 
variables. Likewise, 4 of the 12 recommendations from the NAEP Technical Review 
Panel report (Haertel, 1988) address similar concerns about moving beyond the 
reporting of system averages. 

What kinds of information should states and distrias be using to 
contextualize their reports of test results? In broad terms, three types of data: 
longitudinal trends, performance distributions (e.g., percentage scoring in each 
quarter) within and among schools/districts, and subgroup comparisons (e.g., by 
ethnicity/ race, SES, gender, community type, lariguage status, resource and curricular 
subgroups) and their cross-classifications (e.g., longitudinal trends in the proportion 
of Hispanic students in urban schools within the state score above the 25th 
percentile nationally) come to mind. 

Taken in isol; .ion, each of these types of Information can be misleading and 
misused in much the same way as Cannell (1987) claims that overall state and district 
achievement test results have been. However, when combined, they provide a 
more accurate depiction of the performance of students in the nation's school 
systems. Moreover, publicly reporting achievement data in this more 
comprehensive and informative way would encourage better testing practices and 
public policy discussions about testing results. To use an old colloquialism, it is hard 
to hide one s dirty linen when it is all hanging on the clothesline. 

State and district officials were not explicitly asked to provide information 
about trend, distribution, and subgroup reporting in the CRESST follow-up of the 
Cannell study. Nevertheless, the state reports obtained by Linn et al. (1990) were 
examined to determine whether these types of reporting praaices had expanded 
and improved since our earlier study. The overall picture is still a mixed one. 
Generally, the practice of more refined reporting of assessment data has expanded 
somewhat with well over half the states reporting one of the three types of 
information emphasized here. With respect to trends, e.g., California juxtaposes 
trends from different subject areas on the same graph while South Carolina displays 
the percent of students falling within each national quarter over time for three 
grade levels. The latter display also illustrates attention to the distribution of 
performance within the sUte rather than total absorption with the average. 

Wasb^'igton'u 1990 General Report (Washington Department of Education, 
1990) iiluitrates what can and should be done in reporting performance distributions 
with a state. Figure 11 (see Appendix A) from that report presents performance 
distributions for students categorized by ethnic/minority status, this display uses 
box-and-whisker plots, a very compact and graphically appealing means of 
conveying distributional data. The body of the report provides a succinct and clear 
explanation of the technique. 



VhifA Comment 



I am enthusiastic about efforts to report achievement test data more 
comprehensively and generally unsympathetic toward Cannell's single-mindedness 
on the question of misleading reporting. There are already too many pressures to 
oversimplify matten. In the cases cf Washington, Cannell (1987) managed to notice 
only the single number representing the sute average among the myriads of 
displays and discussions that attempted to document how the students within the 
state were doing. It does a disservice to educational officials to ignore such efforts. 
Moreover, it undoubtedly slows down progress on more important education quality 
reporting to divert all attention to a particular limiting feature of the educational 
achievement yardsticks. We need both better yardsticks and better use of them. 
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FIGURE II- CaRADE4 
Distributions of EthxUc/Mlnority StudcnU' Scoi"^ 
on MAT6 Total Reading October. 1989 
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